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We demonstrate how to parallelize the density matrix renormalization group (DMRG) algorithm 
in real space through a straightforward modification of serial DMRG. This makes it possible to 
apply at least 10 times the computational power to challenging simulations, greatly accelerating 
investigations of two-dimensional systems and large parameter spaces. We discuss details of the 
algorithm and present benchmark results, including a study of valence-bond-solid order in the square- 
lattice Q2 model. 
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The density matrix renormalization group (DMRG) is 
a method for computing ground states of one-dimensional 
(Id) systems [IH3] which has recently proven surprisingly 
effective for studying two-dimensional (2d) models, espe- 
cially those beyond the reach of quantum Monte Carlo 
due to frustratated interactions or mobile fermions [U- 
H"2"] . DMRG is also valuable in quantum chemistry where 
it greatly extends the ability of previous methods to deal 
with strong correlation (T3TU5| . But applying DMRG 
to 2d and quantum chemical models is computationally 
very demanding. Even in Id, DMRG can be costly for 
systems having many degrees of freedom [TMT5] . Find- 
ing an efficient way to divide a single DMRG calculation 
across multiple processors would make larger calculations 
tractable, but a truly parallel DMRG algorithm has re- 
mained an outstanding problem. 

Previous DMRG implementations have achieved a lim- 
ited form of parallelism by parallelizing over different 
terms in the Hamiltonian 1!) . This is effective in a 
quantum chemistry context where the number of terms 
is especially large. However, since the terms contained 
entirely within a single spatial region are already com- 
bined together as much as possible within DMRG, the 
efficiency of this approach is less than ideal. A simi- 
lar type of limited parallelism subdivides matrices into 
quantum number subblocks |20[ 12 1] . But this approach 
is restricted by both a limited range of quantum numbers 
and a large variation in subblock size. The least powerful, 
but simplest form of parallelism breaks up dense matrix 
computations into subblocks and is performed automat- 
ically by standard linear algebra libraries. 

Here we present a much more powerful form of paral- 
lelism, dividing a single DMRG calculation over separate 
regions of the system in real space. Except at the bound- 
ary of each region, the algorithm reduces to standard 
finite-size DMRG, making it relatively straightfoward to 
implement in existing, highly-optimized DMRG codes. 
Real-space parallelism can be used independently of the 
other types discussed above and becomes increasingly ef- 
fective as the system size increases. In practice, we ob- 
serve close to ideal speedups as shown in Fig. [T] 

Some extensions of DMRG are already real-space par- 
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FIG. 1. Timing of parallel DMRG ground state calculations 
for the spin 1/2 Heisenberg model on the 24 x 8 square lattice 
(cylindrical boundary conditions [E]). A speedup of S indi- 
cates that the calculation using n nodes was S times faster 
than the same calculation using one node. Each calculation 
consisted of 10 sweeps and reached a relative energy accuracy 
of 10 -5 by keeping m = 2000 states in the final two sweeps. 



allelizable, such as time-dependent DMRG algorithms 
which rely on factorizations of the time-evolution op- 
erator, as in a Suzuki- Trotter approximation |22[ I23j . 
These algorithms can be used to compute ground states 
through imaginary time evolution. But imaginary time 
evolution is much less efficient than the usual DMRG 
diagonalization-based method, and is therefore primarily 
useful when the standard DMRG ground state approach 
is not applicable (for example, when optimizing 2d tensor 
networks such as PEPS [Ml [23]). 

To describe the algorithm, first consider parallelizing 
a single DMRG calculation over just two regions. For 
concreteness, take the system to have six sites such that 
the center bond connects sites 3 and 4. As a warmup, 
first perform a few sweeps of the standard, serial finite- 
size DMRG algorithm [IHS] keeping only a small number 
of states so this non-parallel part takes little time. Stop 
when the two exposed sites reach the center bond as in 
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FIG. 2. The parallel DMRG algorithm for the simplest case of dividing a single calculation over two machines. Circles 
represent lattice sites and the diamond is the matrix V on the shared bond as in Eq. Q. In step (a), the local information 
needed to sweep left is copied to the left machine and similarly for the right. Each machine then performs a DMRG sweep 
(b) in parallel over its half of the system. In step (c) the wavefunctions are merged together using the prescription preceding 
Eq. ([5]). Before repeating the algorithm, the merged state is optimized (d) using Lanczos or Davidson on the shared bond. 



Fig. [fa). 

At this point the wavefunction within DMRG has the 
form 

l*>= E * a2S3Siai \*2)M\s4)\a 4 ) R (1) 

where S3, S4 = 1, . . . , d label the lattice basis on sites 3 
and 4, and «2, a 4 — 1, . . . , m label orthonormal many- 
body states approximating |$) within the left and right 
blocks, respectively. 

DMRG proceeds in two steps: first, this wavefunction 
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FIG. 3. Prediction fidelity 1-(*^|*„) at the center (shared) 
bond of a DMRG calculation parallelized across two nodes. 
|^n) is the initial merged wavefunction following sweep n as 
in Eq. q5p . \^„) is the wavefunction resulting from a fully con- 
verging Davidson calculation initialized with \9' n ). The sys- 
tems are, from top to bottom, the 200 site S = 1/2 Heisenberg 
chain without warmup sweeps (dashed blue circles); S — 1/2 
chain with warmup sweeps (solid blue circles); S = 1 Heisen- 
berg chain without (dashed black squares) and with warmup 
sweeps (solid black squares). The warmup consisted of 5 se- 
rial DMRG sweeps keeping m = 50 states. For the parallel 
sweeps, m was increased after every other sweep to a maxi- 
mum of 600. 



is optimized using a few Lanczos or Davidson steps with 
the Hamiltonian projected into the |a2)_L|s3)|s4)|a!4).R 
basis [3] . Then a renormalization group procedure is car- 
ried out based on a singular value decomposition (SVD) 
of the amplitudes W 

v r ( (a2S3)(s4Q4) _ A(a 2 s 3 ) a a 3 o(s 4 a 4 ) /n) 

/ * c*3 as ' V / 

013 

(with ^ temporarily treated as an (md) x (dm) matrix). 
Following the SVD, all but the largest m singular values 
are truncated. 

In the serial DMRG algorithm, one next selects either 
the transformation matrix B to grow the right block, 
in which case the product AA is used to compute the 
wavefunction amplitudes on the bond to the left, or one 
chooses the transformation matrix A to grow the left 
block and KB to form the next wavefunction ampli- 
tudes. Iterating these steps generates a sweeping pro- 
cedure which improves the wavefunction at every bond 
sequentially. 

Now assume we have two computers able to work in 
parallel. Following the SVD, the first machine could 
sweep left while the second simultaneously sweeps right. 
However this soon leads to a conceptual issue: each com- 
puter will be working with a different wavefunction glob- 
ally. One can temporarily ignore this problem, but when 
the computers meet again it will be unclear how to merge 
their wavefunctions to perform a DMRG step on their 
shared bond. 

To overcome this inconsistency, rewrite Eq. (12]) but 
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now insert the identity AV — 1, where V — A 1 : 

* aj " s,4a4 = ^ A^ S3 A° 3 V; 3 A a3 5*f 4 (3) 

d ^J2 ^ 2Saa3 V a3 ^4 3S4 " 4 • (4) 

Again both machines can sweep in parallel, but when 
they return to their shared bond — the left with updated 
amplitudes ip' 3 and the right with amplitudes ■04 — there 
is a consistent way to merge the two wave functions. To 
define the merged wavefunction, take the DMRG ba- 
sis states \a^) l for the left from the machine sweeping 
over that region; the states |a4)_R for the right from the 
other machine; and for the amplitudes at the center bond 
choose 

*' = ^ V 3 V 4 , (5) 

similarly to Eq. using the original V. Note that 
the exact ground state is a fixed point of this procedure. 
After merging the two wavefunctions for optimization on 
the shared bond, the merged wavefunction can again be 
split if more parallel sweeps are needed. 

Though there is no communication between machines 
prior to each merge, in practice we find that Eq. ^ pro- 
vides a good initial state for the Lanczos or Davidson 
steps on the shared bond, as shown in Fig. [3j How- 
ever, because each machine otherwise updates the wave- 
function independently, DMRG convergence is typically 
slower at the shared bond, especially near the beginning 
of a calculation. Although this means parallel DMRG 
gives slightly worse results compared to serial DMRG for 
the same number of sweeps, the nearly ideal speedup in 
calculation time more than compensates for this effect. 

Though the discussion above focuses on the wave- 
function, an important part of DMRG is transforming 
any projected operators (such as the Hamiltonian) while 
sweeping. First, as each machine sweeps away from the 
shared bond in parallel, this transformation occurs in 
the usual way. For example, the matrix B from Eq. ^ 
transforms the Hamiltonian into the local basis of the 
next pair of sites to the left. Later when the two ma- 
chines merge their wavefunctions, they merge operators 
in an analogous way: operator terms acting in the left 
half of the system are approximated by their projection 
into the basis states from the left machine and similarly 
for the right. 

The role of the matrix V in the merge can be under- 
stood by observing that DMRG approximately preserves 
the reduced density matrix over regions where it does 
not sweep. Assuming that the wavefunction does not 
change too drastically after each sweep, the matrix V 
normalizes ip'^ on the right such that it approximates the 
reduced density matrix eigenstates on the right half of 
the system, and similarly for the left. Note the resem- 
blance of Eq. ^ to the prediction step in the infinite 
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FIG. 4. Sweeping pattern for one full sweep of the parallel 
DMRG algorithm split over four computational nodes. First, 
(a) the nodes are positioned in a spatially staggered pattern 
and sweep to the other end of their block. When the nodes 
reach the end of their block (b) they wait for their neighboring 
node to arrive then communicate. Finally the nodes sweep 
back (c) to their starting positions and (d) communicate with 
their other neighbor. 



DMRG algorithm of Ref. [26j which directly motivated 
the present work. From an matrix product state point 
of view, Eq. {§} is a transformation from a gauge having 
only one orthogonality center (a tensor whose indices all 
label orthonormal states) to a gauge with two orthogo- 
nality centers. 

The two-block algorithm described above can be read- 
ily extended to n real-space blocks by first splitting the 
system in two, then further splitting each new sub-block 
until there are n total. This suggests the sweeping pat- 
tern illustrated in Fig. [4] Odd numbered nodes start on 
the left side of their block and even nodes on the right. 
This way, when a node reaches the end of its block the 
neighboring node is ready to communicate. If a node 
reaches the end of its block before its neighbor arrives, 
it is better for the node to wait instead of immediately 
beginning the next half sweep. Having an updated envi- 
ronment far outweighs the loss in efficiency due to a node 
briefly remaining idle. 

To demonstrate that real-space parallel DMRG can 
be used to accelerate very challenging two-dimensional 
DMRG calculations, we use it to study the S = 1/2, 
pure Q2 model on the square lattice. This model has 
been proposed as a benchmark for testing the predictive 
ability of DMRG for 2d systems [57] . Extensive quantum 
Monte Carlo (QMC) calculations show the model is in a 
phase with weak columnar valence bond solid (VBS) or- 
der [27H29] . Because DMRG is limited to much smaller 
finite-size systems than QMC, a reasonable concern is 
that DMRG could miss such weak order and possibly 
mistake the phase for a spin liquid. 
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The Hamiltonian of the pure Q2 model is defined as 
H = -Q 2 ( S * ' S J - V4)(S fc • S, - 1/4) (6) 

(ij;kl) 

where the sum is over all pairs of bonds (i,j) and (k,l) 
on opposite sides of the elementary square plaquettes. To 
investigate the columnar VBS order following Ref. [27J we 
study finite-size systems of width L in the y direction and 
length 2L in the x direction. For each system size, we 
measure the order parameter 

(D x ) — — y~^(S r • S r+X ) — -(S r _j • S r ) — — (S r _|_^ • S r+2 x) 
v 

(7) 

averaged over all r = (L, y) at the central column of the 
system. 

In the shorter y direction we take periodic boundary 
conditions but use open boundary conditions in the x di- 
rection for technical reasons. In contrast to Ref. [27J how- 
ever, we include extra Hamiltonian terms at the edges to 
pin columnar VBS order. Following the strong-pinning 
prescription of Ref. 1121 we imagine fictitious spins just 
beyond the edge of the system locked into an ideal colum- 
nar VBS in the x direction. Tracing over these fictitious 
spins induces a term 

-ffpin,r = ~^~(Sr ' Sr+y — 1/4) (8) 

on each vertical bond along the edges of the real system. 
The combination of these terms and a 2 : 1 aspect ratio 
helps to control finite-size effects. 



We carried out parallel DMRG calculations for cylin- 
ders of width L = 4, 6, 8 and 10 and show the result- 
ing (D x ) order parameter values in Fig. ^ Each calcu- 
lation was parallelized over four real-space blocks with 
each block assigned to a separate 8-core Intel Harper- 
town 2.66 GHz node. The largest calculation — keeping 
up to m — 3000 states for the L — 10 system — took 6 
days and would have taken 18-21 days without paral- 
lelization. All calculations could have easily been further 
parallelized with access to more nodes. 

With our choice of aspect ratio and boundary condi- 
tions, we found only a weak dependence of the order pa- 
rameter on system size for large enough L. By averaging 
(D x ) for the largest three systems, we estimate a value 
(D x ) — 0.078(6) for the thermodynamic limit, in good 
agreement [3^ with the value 0.078 predicted by quan- 
tum Monte Carlo [271 P- 10]. 

Looking ahead, we expect to see real-space paral- 
lelism become a standard tool for accelerating challenging 
DMRG calculations since it can be implemented in ex- 
isting codes. We also hope this work encourages authors 
of DMRG-related papers to identify the parallel aspects 
of their methods even more prominently. 

This paper is dedicated to Ernie Compton, a gifted sci- 
ence teacher and inspiring role model. We thank Thomas 
Barthel, Bela Bauer, Bryan Clark, Stefan Depenbrock, 
Adrian Feiguin, Hong-Chen Jiang, Salvatore Manmana, 
Ian McCulloch, Ulrich Schollwock, Guifre Vidal, and 
Zhenyue Zhu for helpful discussions. This work is sup- 
ported by NSF grant DMR-1161348. 
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FIG. 5. Columnar VBS order parameter (D^) on open cylin- 
ders of the square-lattice Q2 model Eq. (JsJ with strong- 
pinning boundary conditions |12) . Each order parameter 
value was estimated by increasing the states kept m after 
every other sweep and extrapolating in the truncation error. 
Error bars are smaller than symbol sizes except for the L = 10 
system. 



[1] S. R. White, |Phys. Rev. Le tt., 69, 28 63 (1992)| S. White, 

|Phys. Rev. B, 48, 10345 (1993)| 
[2] U. Schollwock, |Rev. Mod. Phys', 77, 259 (2005)] 
[3] R. M. Noack and S. R. M anmana, |AIP Conference Pro- 
ceedings, 789, 93 (2005)1 



[4j S. R. White a nd A. Chernyshev, |Phys. Rev. Lett., 99, 



127004 (2007) 



[5j H.-C. Jia~rIg~Z.-C. Gu , X.-L. Qi, and S. Trebst, [PrrPsT 
Rev. B, 83, 245104 (2011)1 



[6] S. Yan DTA. Huse, and S. R. White, [Science, 332, 1173 
(2011)[ 

[7] P. Corboz, S. R. White, G . Vidal, and M. Troyer, [PlysT 
Rev. B, 84, 041108 (2011)1 

|8| H.-C. Jiang, H . Yao, and L. Balents, |Phys. Rev. B, 86, 



024424 (2012) 

[9J B. Bauer, P. Corboz, A. M. Lauchli, L. Messio, K. Penc, 
M. Troyer, and F. Mila, |Phys. Rev. B, 85, 125116 (2012)| 
[10] S. Depenbrock, I. P. McCulloch, and U. Schollwock, 



|Phys. Rev. Lett., 109, 067201 (2012) 






[11] L. Cincio and G. Vidal, arxiv: 1208.2623 (2012) 


[12] E.M. Stoudenmire and S. R. White, 


Annual Review of 


Condensed Matter Physics, 3, 111 (2012) 


[13] Y. Kurashige and T. Yanai,|J. Chem. Phys., 135, 094104 



(2011)| 



[14] G. K.-L. Chan and S. Sharma, Annual Review of Physical 



[15] 

EH 
ttti 



J. Chem. Phys., 136 



Chemistry, 62, 465 (2011)] 
S. Sharma and G. K.-L. Chan 
124121 (2012)] 
J. Hachmann, W. Cardoen, and G. K.-L. Chan, The 
Journal of Chemical Physics, 125, 144101 (2006)] 



M. Dolfi, B. Bauer, M. Troyer, and Z. Ristivojevic, Phys 
Rev. Lett., 109, 020604 (2012)] 
[18j E. M. St oudenmire, L. O. Wagner, S. R. Whit e, and 

K. Burke, |Phys. Rev. Lett., 109, 056402 (2012)] 
[19] G. K.-L. Chan, |J. Chem. Phys., 120, 317 2 (2004) 
[20] G. Hager, E. Jeckelmann, H. Fehske, and G. Wellein, 

|Journal of Computational Physics, 1 94, 795 (2004)] 

[21] Y. Kurashige and T. Yanai, [The Jour nal of Chemical| 

Physics, 130, 234114 (2 009) 
[22] A. Daley, C. Kollath , U. Schollwock, and G. Vidal, [X] 



| Stat. Mech., 04, 005] 
[23] S. R. White and A. E. Feiguin, |Phys. Rev. Lett., 93 



076401 (2004) 
Y. 



|24| Z. Y. Xie H. C. Jiang, Q. N. Chen, Z. Y. Weng, 
T. Xiang, |Phys. Rev. Lett., 103, 160601 (2009)] 



and 



[25] I. Pizorn, L. W ang, and F. Verstraete, Phys. Rev. A, 83, 
052321 (20TT)1 

[26] I. P. McCulloch |arxiv:0804.2509 (2008)| 

[27] A. W. Sandvik, Phys. Rev. B, 85, 134407 (2012) 
[28] A. W. Sandvik, Phys. Rev. Lett., 98, 227202 (2007) 
[29] A. W. Sandvik, [Phys. Rev. Lett., 104 177201 (2010)| 
[30] Our calculations only agree up to a factor of precisely 
two. Up to this factor, we have successfully reproduced 
the results (not shown) of Ref. 1271 when omitting the edge 
pinning terms. We therefore report twice the value stated 
in Ref. [27] (p. 10) as the QMC estimate of the 2d order 
parameter. 



